Finding “It”: Weakly-Supervised Reference-Aware Visual Grounding in Instructional Videos

نویسندگان

De-An Huang

Shyamal Buch

Lucio Dery

Animesh Garg

Li Fei-Fei

Juan Carlos Niebles

چکیده

Grounding textual phrases in visual content with standalone image-sentence pairs is a challenging task. When we consider grounding in instructional videos, this problem becomes profoundly more complex: the latent temporal structure of instructional videos breaks independence assumptions and necessitates contextual understanding for resolving ambiguous visual-linguistic cues. Furthermore, dense annotations and video data scale mean supervised approaches are prohibitively costly. In this work, we propose to tackle this new task with a weakly-supervised framework for reference-aware visual grounding in instructional videos, where only the temporal alignment between the transcription and the video segment are available for supervision. We introduce the visually grounded action graph, a structured representation capturing the latent dependency between grounding and references in video. For optimization, we propose a new reference-aware multiple instance learning (RA-MIL) objective for weak supervision of grounding in videos. We evaluate our approach over unconstrained videos from YouCookII and RoboWatch, augmented with new reference-grounding test set annotations. We demonstrate that our jointly optimized, reference-aware approach simultaneously improves visual grounding, reference-resolution, and generalization to unseen instructional video categories.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Knowledge Aided Consistency for Weakly Supervised Phrase Grounding

Given a natural language query, a phrase grounding system aims to localize mentioned objects in an image. In weakly supervised scenario, mapping between image regions (i.e., proposals) and language is not available in the training set. Previous methods address this deficiency by training a grounding system via learning to reconstruct language information contained in input queries from predicte...

متن کامل

Unsupervised Learning and Segmentation of Complex Activities from Video

This paper presents a new method for unsupervised segmentation of complex activities from video into multiple steps, or sub-activities, without any textual input. We propose an iterative discriminative-generative approach which alternates between discriminatively learning the appearance of sub-activities from the videos’ visual features to sub-activity labels and generatively modelling the temp...

متن کامل

Semi-supervised Learning for Identifying Players from Broadcast Sports Videos with Play-by-Play Information

Tracking and identifying players in sports videos filmed with a single moving pan-tilt-zoom camera has many applications, but it is also a challenging problem due to fast camera motions, unpredictable player movements, and unreliable visual features. Recently, [26] introduced a system to tackle this problem based on conditional random fields. However, their system requires a large number of lab...

متن کامل

Grounding Action Descriptions in Videos

Recent work has shown that the integration of visual information into text-based models can substantially improve model predictions, but so far only visual information extracted from static images has been used. In this paper, we consider the problem of grounding sentences describing actions in visual information extracted from videos. We present a general purpose corpus that aligns high qualit...

متن کامل

Context-aware visual analysis of elderly activity in a cluttered home environment

This paper presents a semi-supervised methodology for automatic recognition and classification of elderly activity in a cluttered real home environment. The proposed mechanism recognizes elderly activities by using a semantic model of the scene under visual surveillance. We also illustrate the use of trajectory data for unsupervised learning of this scene context model. The model learning proce...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2018

Finding “It”: Weakly-Supervised Reference-Aware Visual Grounding in Instructional Videos

نویسندگان

چکیده

منابع مشابه

Knowledge Aided Consistency for Weakly Supervised Phrase Grounding

Unsupervised Learning and Segmentation of Complex Activities from Video

Semi-supervised Learning for Identifying Players from Broadcast Sports Videos with Play-by-Play Information

Grounding Action Descriptions in Videos

Context-aware visual analysis of elderly activity in a cluttered home environment

عنوان ژورنال:

اشتراک گذاری